Adaptive sampling by information maximization 
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The investigation of input-output systems often requires a sophisticated choice of test inputs to 
make best use of limited experimental time. Here we present an iterative algorithm that continuously 
adjusts an ensemble of test inputs online, subject to the data already acquired about the system 
under study. The algorithm focuses the input ensemble by maximizing the mutual information 
between input and output. We apply the algorithm to simulated neurophysiological experiments 
and show that it serves to extract the ensemble of stimuli that a given neural system "expects" as 
a result of its natural history. 

PACS numbers: 87.10.+e, 89.70.+C, 07.05.-t, 87.19.-j 



Biophysical systems often have many degrees of free- 
dom and thus one needs large numbers of variables and 
parameters to describe them. Without strong prior 
knowledge about the intrinsic dynamics of such a sys- 
tem, one is left with inferring its function from data 
obtained by experiments or observations. Given a sys- 
tem where we control a set of "input" variables x — 
(x^\ x^ 2 \ . . . ,x^ n ') and measure another set of "output" 
variables y — (y^^y^ 2 ', ■ ■ ■ , j/™- 1 ), we can actively ma- 
nipulate the data acquisition by selecting the most infor- 
mative test inputs. Yet how should one choose the test 
inputs to learn most about the input-output relation? 

Within the classical Volterra- Wiener system identifi- 
cation methods B, the input space is sampled by draw- 
ing inputs from a probability distribution p(x)} a com- 
mon choice is Gaussian "white noise". However, not 
all aspects of the system's input-output relation may be 
equally important. In neurobiology, for instance, one is 
especially interested in inputs x about which a given sen- 
sory system conveys most information. In the spirit of 
importance sampling H, one might therefore focus the 
data acquisition on those x that contribute most to the 
information transfer. For a given input distribution, the 
information provided by a single input can then be quan- 
tified as I(x) = H y — H y (x) where H y is the entropy of 
the output distribution p(y) and H y (x) is the entropy 
of the conditional probabilities p{y\x) which characterize 
the input-output relation || ^j. Hence, the appropriate 
focusing is achieved by an input distribution p opt (x) that 
maximizes the mutual information I = (I(x)} where the 
angular brackets denote averaging over p opt (x) . 

Without any information about the system and its 
input-output relation, the optimal input distribution 
Popt (a;) is unknown. Any experimental test of the system 
must therefore start with drawing the test inputs from 
some predefined distribution p$(x) that depends on a set 
of parameters <ft — (0 < - 1 \ . . . , </>^). Once data about the 
system has been acquired, however, one need not adhere 
to this initial choice of an input distribution. Instead, 
one should adapt the parameters or even the structure 



of Ptj,(x) to better focus on the important inputs. In 
this letter, we show how to systematically perform this 
adaptation. By iterating the adaptation procedure, the 
acquired data becomes ever more useful and the input 
distribution approaches the optimum. 

Adapting the input distribution — For mathematical 
simplicity, we assume that both input and output take 
discrete values. Say that we have already tested the sys- 
tem with N different inputs x% each of which was pre- 
sented Mi times while measuring the outputs with i = 
1 . . . N and j = 1 . . . Mi . We define the set of all differ- 
ent output values measured so far by {yk '■ k — 1 ... if}. 
Our present knowledge about the system is summarized 
by the conditional probability that an output y& was ob- 
tained from the input Xi, 
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The estimated probabilities q(yk\%i) allow us to re- 
evaluate the relative importance of the inputs Xi in terms 
of their potential contribution to the mutual information. 
To measure this contribution, we assign a probability or 
"weight" q{xi) to every input. Initially we assume that 
all inputs Xi contribute equally and set q%{xi) = 1/N. 
To find a combination of weights that maximizes the in- 
formation transfer, we use the Blahut-Arimoto algorithm 
H and readjust the weights, 
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ization constant so that J2 i= i Q: 



< and Z is a normal- 
i(xi) = 1. Accord- 
ing to Eq. (|2|), the weight of an input Xi is decreased 
if its conditional output distribution q(yk\%i) is similar 
to the total output distribution q n {yk)- In contrast, the 
weight of an input Xi is increased if the respective dis- 
tributions differ. When Eq. (0) is iterated, the weights 
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converge and reach a global maximum of the mutual in- 
formation . In practice, we terminate the process once 
1 — q n +i(xi) I 'q n (xi)\ < e for all i and some chosen preci- 
sion e and set q op t(xi) = QW+i (#,*). 

The weights or probabilities q pt(xi) describe the rel- 
ative frequencies with which the respective inputs x% 
should be drawn. Consequently, we need to adapt the 
parameters <f> so as to find a matching distribution p^ (x) . 
Here we determine the new parameters q> by maximizing 
the log-likelihood function M, 
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q op t(xi) log p,p(xi) (3) 



where the probabilities q pt(xi) provide the appropriate 
weights. For some model distributions, e.g. Gaussians, 
the maximum can be found analytically. In general, how- 
ever, one has to evaluate the maximum numerically. 

The input distribution given by the new parameter val- 
ues can be used to draw new test inputs, present them 
to the system and measure the respective outputs. After 
a certain amount of data has been acquired, the param- 
eters 4> of the input distribution can be adapted again. 
The resulting iterative algorithm moves the input distri- 
bution towards an optimal ensemble. 

Model quality and convergence — Every maximum of 
the mutual information with respect to p(x) is a global 
maximum Q|. Hence, if the model distribution does not 
rule out any inputs, i.e., p${x) > for all x and </>, the 
estimates of the input-output relation, Eq. (|l]), converge, 
and therefore q op t(Xi) — * Popt(^i)- Accordingly, the mu- 
tual information 7d = (77* — H y {x)) q achieves the infor- 
mation capacity of the system; here, the index q denotes 
that the respective quantities and averages are calculated 
with respect to q op t(xi). 

The model distribution p r p(x) converges towards an op- 
timal fit of Popt(^)- To control how well the model dis- 
tribution captures the structure of the optimal distribu- 
tion, one can check the mutual information achieved by 
the model, 7m = (77^ — H y (x)) < j > , which is calculated with 

respect to r^(xj) = P<f,(xi)/[Y,j=i P<t>( x j)]- The fraction 7 
of the mutual information captured by the model is then 
defined as 



7m 

7 = T 



(4) 



and provides a measure for the quality of the model. 
Hence, if 7 falls significantly below one, the model does 
no longer capture the structure of the optimal ensemble; 
in such a case, one might increase the complexity of the 
model. 

In general, the algorithm will not be able to adapt the 
input ensemble if the presented inputs always result in 
the same output value. Similarly, there is no possibility 
to weight the inputs xi differently if every input elicits a 
new, different output. However, the latter problem can 
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FIG. 1: Approaching the optimal input ensemble of a neuron 
with static, one-dimensional input and output, (a) Plot of the 
conditional probability distribution p(C\I) with spike count C 
and input current I. The uncertainties at / ~ 22 /zA/cm 2 are 
due to a decline in spike size that makes it impossible to detect 
the spikes in the noisy voltage output. For / « 28 /lA/cm 2 , 
the model neuron ceases to generate spikes, (b) Approaching 
the optimal input distribution (bars). Shown are the initial 
distribution (1), the distributions of the iterations (2) and (4), 
as well as the final distribution (00). (Simulation parameters: 
n = 1, m = 1, L = 2, A = 10, B = 5, e = 0.1, a n = 4 /iA/cm 2 , 
/„ = 1000 Hz) 



be solved by discretizing the output side into a smaller 
number of possible outputs. The input space, on the 
other hand, can be discretized as fine as needed without 
impeding the convergence of p^x). 

Example — To illustrate the method, we study a numer- 
ical simulation of a Hodgkin-Huxley-type model neuron 
0. The model neuron transforms an input current 7 
into a voltage output V. For constant current values 
7 < (x A/ cm 2 , the voltage approaches a stable equi- 
librium. For current values 7 > /iA/cm 2 , the model 
undergoes a saddle-node bifurcation and generates peri- 
odically occurring action potentials, also called spikes || . 
Stochastic aspects of neural activity are incorporated by 
adding Gaussian white noise with a fixed standard devi- 
ation rjjj and a cut-off frequency f v to the input. 

We start with a simple one-dimensional parametriza- 
tion of input and output. The inputs are 100-ms- 
long, discretized current steps (A7 = 1 /iA/cm 2 ), 
restricted to a physiologically realistic range of 7 = 
— 12 ... 28 [iA/cm 2 . The outputs are given by the num- 
ber of spikes, C, during the corresponding time window. 
The resulting probabilistic relation of spike count versus 
current is displayed in Fig. |l|(a). 

For this one-dimensional input-output system, we can 
compute an exact solution of the information maximiza- 



3 



tion problem. The optimal input distribution p opt (I) is 
depicted by the vertical bars in Fig. |](b); the shape of 
Popt (I) corresponds to the slope of the input-output rela- 
tion H . Note that there is a slight increase in the proba- 
bilities of inputs far below threshold (I < —10 fiA/cm 2 ). 
These inputs result almost certainly in a zero spike count 
output. At the same time, inputs closer to threshold 
(I ss —9... — 1 /iA/cm 2 ) are more likely to produce 
spikes. As the optimal input distribution favors inputs 
that are more reliable, the inputs closer to threshold are 
neglected. 

To study the performance of the iterative algorithm, 
we model the optimal input distribution by a truncated 
Gaussian. As initial parameter values, we choose a mean 
tfft) = —10 /iA/cm 2 and a standard deviation = 
10 /iA/cm 2 . In each iteration, we draw A current values 
from the Gaussian, test them B times on the system, 
and adapt the parameters. For our Gaussian model, the 
maximum likelihood estimate of the new parameters is 
given by <j>W = Eili J i9o P t(ii) and ^ = [Y^Liik ~ 

The Gaussian model distributions are displayed in 
Fig. [l](b) for the first few iterations. Most of the cur- 
rent values drawn from the initial distribution fall be- 
low the spiking threshold of the neuron. Consequently, 
the algorithm shifts the Gaussian distribution into the 
spiking regime of the neuron. After about 10 iterations, 
the mutual information rate saturates at « 40 bits/sec. 
Since both the final Gaussian model p,p(I) and the op- 
timal input distribution p O pt(-0 lead to approximately 
the same information transfer, the landscape of the mu- 
tual information with respect to the input distribution 
is relatively flat around the maximum; it suffices if 
the input distribution covers the relevant input range 
(J r* ... 25 /iA/cm 2 ). Note, that due to the maximum- 
likelihood estimation, Eq. (||), the final Gaussian distri- 
bution has the same mean and variance as the optimal 
distribution. 

Multi- dimensional example — The computational 
power of the algorithm becomes clearly visible for 
high-dimensional input spaces. As an example, consider 
the above model neuron when the input consists of 
time-varying, statistically stationary currents, dis- 
cretized in time steps of Aii. Following (To), we slide 
overlapping windows of length T = nAti across the 
input current trace and use the values within each 
window as input vector Zj = (i^ ,1^ ). For each 

of these inputs the output Cy is given by the spike 
times, discretized in time steps of At2 = T/m, during 
the corresponding window. Hence, each input consists 
of n real-valued numbers bounded within the interval 
/ = —12... 28 /iA/cm 2 , and each output consists of 
m numbers whose values are either zero (no spike) or 
one (spike). Note, that we do not explicitly discretize 
the current values; we instead assume that every input 
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FIG. 2: Approaching the optimal input ensemble of a neu- 
ron with time-varying input and output, (a) Evolution of 
average and power spectrum, (b) Evolution of information 
rate and (c) model quality for three different initial condi- 
tions. (Simulation parameters: n = 64, m = 16, L = 33, 
A = 1000, B = 20, e = 0.1, cr„ = 4 /iA/cm 2 , /„ = 1000 Hz, 
Aii = 0.25 ms, At2 = 1 ms, T = 16 ms; windows slided by 
Ai2; accordingly, AAt2B x 100 iterations w 34 minutes) 

Ii is unique. For simplicity, we use a Gaussian input 
distribution. As the input is real and stationary, it 
suffices to use L = n/2 + 1 parameters for describing 
average and power spectrum of the current trace. 

To test the system, we choose an initial distribution 
with an average <jP^ = /iA/cm 2 and a flat power 

spectrum with standard deviation a = [ ( t )< '^] = 
10 nA/cm 2 . For this prior, only 50% of the input val- 
ues lie above threshold and the inputs will rarely lead 
to high firing rates, cf. Fig. [|. Consequently, we do not 
properly explore the full range of the input-output rela- 
tion; if, for example, we test the system for 30 minutes 
with input currents drawn from this initial distribution, 
the information rate Id does not exceed w 300 bits/sec. 

When using the iterative algorithm to adapt the pa- 
rameters of the input ensemble, on the other hand, the 
information rate 7d saturates around « 670 bits/sec af- 
ter about 20 minutes. Figure |^(a) shows how the power 
spectrum is shaped during the iterations. Only input 
frequencies below 500 Hz are well suited for the infor- 
mation transfer, the cut-off is roughly determined by the 
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maximum firing rate of the model neuron. The overall 
increase in power leads to input currents that override 
the additive noise rj of the model neuron. 

Initial conditions, convergence, and degeneracies — 
When the initial distribution is very narrow (flat power 
spectrum up to f c = fOOO Hz, with a = 1 //A/cm 2 , 
= 20 iiA/crn 2 , Fig. |(b), dotted line), most of the 
input currents drive the neuron maximally and thereby 
very reliably. The strong initial bias leaves the algo- 
rithm with little maneuvering space for the parameter 
re-estimation so that it takes longer to approximate an 
optimal input distribution. 

In the worst case, every input leads to the same output 
value. With the initial choice ef)' 1 ' = —6 //A/cm 2 and a = 
10 //A/cm 2 , the input does not elicit any spikes during 
the first iterations, cf. Fig. §(b), dashed line. However, 
once a spike has appeared, the statistics of the model 
distribution immediately moves into the direction of the 
statistics of the inputs /; that caused a spike. When the 
algorithm has tracked the relevant input range, a rapid 
increase of the information rate follows. 

In the examples studied, the mutual information 
reaches approximately the same value independent of the 
initial conditions, cf. Fig. ||(b). Although there is always 
a clear preference for frequencies below 500 Hz, however, 
the parameters of the optimal input ensemble do not con- 
verge to the same set of values. Consequently, there is 
no unique combination of parameters that maximizes the 
mutual information; an observation that generalizes be- 
yond the specific examples chosen. Nonetheless, the final 
input distributions always capture about 7 = 80% of the 
mutual information Id, cf. Fig. ^|(c). 

In general, there might be "degenerate" subsets in 
stimulus space, i.e., sets of stimuli that lead to the same 
output value. In these cases, the total probability as- 
signed to such a subset can be distributed in an arbi- 
trary way on the subset and any statistical parameters <fi 
that depend on these subsets can assume different val- 
ues without significant consequences for the information 
transfer. 

Neurophysiological interpretation — Recent studies in- 
dicate that sensory neurons convey large amounts of in- 
formation if the properties of the stimulus ensembles used 
match those of natural stimuli [ fj"i"| . Here we have shown 
how to extract a stimulus ensemble that conveys the max- 
imum possible information without any prior knowledge. 
The proposed method could therefore serve to find the 
ensemble of stimuli that a given neuron naturally "ex- 
pects". Note that in contrast to previous online algo- 
rithms such as Alopex or Simplex [ p"2"| , we are not looking 
for a single optimal stimulus but rather for a complete 
ensemble of stimuli. 

Our results demonstrate also that the optimal stimu- 
lus ensemble depends on the chosen criteria about what 
aspect of the output carries the relevant information. 
Hence, if the investigated model neuron conveys informa- 



tion in its average firing rate, then it best encodes slow- 
varying current values in the range / = ... 23 //A/cm 2 . 
Synaptic inputs should therefore drive the neuron in the 
corresponding range. If, on the other hand, a neuron en- 
codes its information in the precise timing of spikes, then 
the synaptic input should be of a more binary nature and 
either fully excite or fully inhibit the neuron. Measuring 
the time courses of a neuron's membrane potential thus 
allows conclusions about the used neural code under op- 
timal conditions. 
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